library(ggplot2)
<- ggplot(data = mtcars, aes(x = mpg))
p p
Histogram
Concepts
Purpose A histogram is a type of bar chart that represents the distribution of numerical data by dividing the data into bins or intervals. Each bin represents a range of values, and the height of each bar depicts the frequency or count of data points within that range. Histograms are used to visualize and understand the shape, central tendency, variability, and presence of outliers in a data set.
Bins and Intervals Bins (also known as intervals, classes, or buckets) are crucial in histograms. The choice of bin size and number can significantly affect the histogram’s appearance and interpretability. Too many bins can make the data appear noisy and overfit, while too few can oversimplify the data’s structure. The bins are typically of equal size, although they don’t necessarily have to be.
Frequency vs. Density Histograms can display either frequency (the count of observations within each bin) or density (the proportion of observations within each bin, adjusted for the width of the bins). Frequency histograms are useful for understanding the distribution of data counts, while density histograms are better for comparing distributions between different-sized samples.
Shape of the Distribution The shape of a histogram can provide insights into the underlying distribution of the data. Common shapes include symmetric, skewed (left or right), bimodal (two peaks), and uniform (flat). Identifying the shape helps in understanding the data’s characteristics and underlying processes.
Outliers and Gaps Histograms can help identify outliers, gaps, or unusual patterns in the data. Large gaps between bars can indicate regions with no data, while bars that are significantly higher than others might represent outliers or clusters of unusual observations. These features can prompt further investigation into the data and its sources.
Histogram - Layer by layer
1. Data Layer
First, define the dataset you are going to use and map your aesthetics—this involves specifying which variable will be plotted.
2. Geometric Layer
Add the geometric object that represents the type of plot you want to create. For histograms, this is geom_histogram().
# Add the geometric layer
<- p + geom_histogram(binwidth = 5, fill = "lightblue", color = "black") # binwidth -change bin's size #bin
p p
3. Scale Layer
The scale layer controls how data values are converted into visual properties. For histograms, you might customize the x (bins) and y (counts) scales.
# Add the scale layer
<- p + scale_x_continuous(
p name = "Miles per Gallon", # Custom x-axis label
limits = c(5, 40), # Set x-axis limits
breaks = seq(5, 40, 2.5), # Set x-axis breaks
)
<- p + scale_y_continuous(
p name = "Frequency", # Custom y-axis label
limits = c(0, 15), # Set y-axis limits
breaks = seq(0, 15, 3), # Set y-axis breaks
)
p
4.Coordinate System Layer
While the default Cartesian coordinates are generally suitable for histograms, you have the option to adjust them if necessary.
# Optional: Add coordinate system adjustments
# p <- p + coord_cartesian(xlim = c(a, b), ylim = c(c, d))
5. Theme Layer
Customize the non-data appearance of your plot using themes.
# Add the theme layer
<- p + theme_minimal()
p p
6. Labels Layer
Add titles, subtitles, and axis labels to provide context and information.
# Add labels
<- p + labs(
p title = "Distribution of Miles per Gallon",
x = "Miles per Gallon",
y = "Count"
)<- mean(mtcars$mpg, na.rm = TRUE)
mean_mpg <- p + geom_vline(xintercept = mean_mpg, linetype = "dotted", color = "red", size = 1)
p
p